We present an unsupervised learning framework for the task of monocular depthand camera motion estimation from unstructured video sequences. We achieve thisby simultaneously training depth and camera pose estimation networks using thetask of view synthesis as the supervisory signal. The networks are thus coupledvia the view synthesis objective during training, but can be appliedindependently at test time. Empirical evaluation on the KITTI datasetdemonstrates the effectiveness of our approach: 1) monocular depth performingcomparably with supervised methods that use either ground-truth pose or depthfor training, and 2) pose estimation performing favorably with established SLAMsystems under comparable input settings.
展开▼